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ABSTRACT 


INTRODUCTION 


In  this  article,  we  present  a  new  method  termed  Cat- 
Fam  (Catalytic  Families)  to  automatically  infer  the 
functions  of  catalytic  proteins,  which  account  for  20- 
40%  of  all  proteins  in  living  organisms  and  play  a 
critical  role  in  a  variety  of  biological  processes.  Cat- 
Fam  is  a  sequence-based  method  that  generates 
sequence  profiles  to  represent  and  infer  protein  cata¬ 
lytic  functions.  CatFam  generates  profiles  through  a 
stepwise  procedure  that  carefully  controls  profile  qual¬ 
ity  and  employs  nonenzymes  as  negative  samples  to 
establish  profile-specific  thresholds  associated  with  a 
predefined  nominal  false-positive  rate  (FPR)  of  predic¬ 
tions.  The  adjustable  FPR  allows  for  fine  precision 
control  of  each  profile  and  enables  the  generation  of 
profile  databases  that  meet  different  needs:  function 
annotation  with  high  precision  and  hypothesis  genera¬ 
tion  with  moderate  precision  but  better  recall.  Multi¬ 
ple  tests  of  CatFam  databases  (generated  with  distinct 
nominal  FPRs)  against  enzyme  and  nonenzyme  data¬ 
sets  show  that  the  method’s  predictions  have  consis¬ 
tently  high  precision  and  recall.  For  example,  a  1% 
FPR  database  predicts  protein  catalytic  functions  for  a 
dataset  of  enzymes  and  nonenzymes  with  98.6%  preci¬ 
sion  and  95.0%  recall.  Comparisons  of  CatFam  data¬ 
bases  against  other  established  profile-based  methods 
for  the  functional  annotation  of  13  bacterial  genomes 
indicate  that  CatFam  consistently  achieves  higher  pre¬ 
cision  and  (in  most  cases)  higher  recall,  and  that  (on 
average)  CatFam  provides  21.9%  additional  catalytic 
functions  not  inferred  by  the  other  similarly  reliable 
methods.  These  results  strongly  suggest  that  the  pro¬ 
posed  method  provides  a  valuable  contribution  to  the 
automated  prediction  of  protein  catalytic  functions. 
The  CatFam  databases  and  the  database  search  pro¬ 
gram  are  freely  available  at  http://www.bhsai.org/ 
downloads/catfam.tar.gz. 
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The  continual  advancements  in  genome  sequencing  technol¬ 
ogy  are  contributing  to  the  exponential  increase  of  the  rate  at 
which  we  accumulate  protein  sequence  data.l’^  Unfortunately, 
our  ability  to  experimentally  ascertain  the  function  and  anno¬ 
tate  protein  sequences  has  not  increased  at  the  same  rate,  con¬ 
tinually  increasing  the  gap  between  protein  sequence  data  and 
their  functional  annotation.^  Hence,  although  not  perfect, 
computational  methods  arguably  offer  the  only  feasible  solution 
for  addressing  this  disparity  and  providing  high-throughput 
annotation  of  protein  function.  Among  the  various  protein 
functions  to  be  annotated,  enzyme  catalytic  functions  are  of 
great  importance  because  about  20-40%  of  the  genes  in 
genomes  of  the  three  domains  of  life  code  for  enzymes, ^  which 
play  many  critical  roles  in  a  variety  of  biological  processes  in 

living  organisms. 

Traditionally,  the  computational  prediction  of  protein  cata¬ 
lytic  functions  has  been  based  on  function  transfer  among  ho¬ 
mologous  proteins,  which  assumes  that  functions  are  shared 
among  proteins  with  similar  sequences  or  structures.^  BLAST  ^ 
and  other  equivalent  search  methods  have  enabled  for  fast  and 
efficient  searches  of  similar  sequences  in  large  databases.  It  has 
become  a  common  practice  to  perform  BLAST  searches  of  a 
query  protein  against  a  function-annotated  protein  sequence 
database,  such  as  the  Swiss-Prot  database  (http://expasy.org/ 
sprot/),  and  transfer  the  annotated  proteins’  functions  to  the 
query  protein  for  those  proteins  that  exceed  a  specified 
sequence  similarity  threshold  (i.e.,  an  E- value  cutoff).  However, 
the  accuracy  of  such  methods  is  frequently  questioned. 
Although  most  proteins  with  high  sequence  similarity  very 
likely  share  similar  functions,  exceptions  have  been  reported. 
Particularly  for  enzymes,  small  changes  in  key  residues  have 
shown  to  change  protein  function. H>12  More  accurate  function 
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predictions  may  be  achieved  by  structure-based  homol¬ 
ogy  methods.  For  example,  George  et  al proposed  a 
method  based  on  the  presence  of  particular  amino  acid 
residues  at  a  few  active  sites  in  the  three-dimensional 
(3D)  structure  of  a  protein.^  However,  the  lack  of  3D 
structural  information  for  the  majority  of  the  sequenced 
proteins  significantly  limits  its  application.  Homology- 
based  methods  are  also  constrained  by  the  limited  num¬ 
ber  of  homologous  proteins  that  have  been  well  anno¬ 
tated  and  by  the  difficulty  in  making  reliable  predictions 
for  proteins  with  very  low  sequence  similarity.  Homology 
methods  can  completely  fail  to  determine  “orphan 
enzyme”  activity,  that  is,  a  catalytic  activity  for  which  no 
sequence  information  is  available,  14-16  or  activity  of 
an  “orphan  gene”  that  has  no  detectable  homologs  in 

other  organisms. 

An  alternative  approach  that  may  complement  homol¬ 
ogy-based  methods  is  the  one  based  on  ab  initio  meth¬ 
ods.  They  employ  statistical  and  machine-learning-based 
techniques,  17-22  sucb  as  Bayesian  classification,  decision 
trees,  association  rules,  neural  networks,  and  support 
vector  machines,  19,20  to  classify  protein  catalytic  func¬ 
tions  using  various  features  derived  from  sequence  and/ 
or  structure  information  of  proteins  with  known  func¬ 
tions.  These  features  include  sequence-related  physico¬ 
chemical  properties,  18>19  such  as  polarity,  hydrophobic- 
ity,  Van  der  Waals  volume,  and  glycosylation,  as  well  as 
structure-related  information,  such  as  predicted  second¬ 
ary  structure.  18>22  p>eSpite  the  potential  complementary 
benefit  of  utilizing  these  approaches  when  homology 
methods  fail,  the  reported  ab  initio  methods  17-20,22 
only  predict  the  first  two  digits  of  the  four- digit  Enzyme 
Commission  (EC)  number  used  for  catalytic  function 
characterization.  Therefore,  given  the  ever  increasing 
amount  and  availability  of  protein  sequence  data, 
improved  sequence-based  homology  methods  currently 
provide  the  most  practical  computational  solution  for 
predicting  catalytic  function  on  a  genome-wide  scale. 

In  one  such  effort,  a  novel  probabilistic  method  was 
proposed^3>24  to  impr0ye  catalytic  function  predictions 
based  on  BLAST  searches  of  a  database  of  annotated 
enzymes.  For  a  query  enzyme,  the  method  takes  into 
account  all  BLAST  search  results  and  employs  Bayesian 
statistics  to  determine  the  most  probable  EC  number  for 
the  query  enzyme.  The  method  predicts  enzyme  functions 
with  high  precision  but  is  limited  to  the  cases  where  the 
query  protein  is  known  to  be  an  enzyme.  A  more  general 
approach  that  has  shown  to  significantly  improve  the  ac¬ 
curacy  of  sequence-based  methods^  is  to  generate 
sequence  “profiles”  to  characterize  the  functions  of  similar 
protein  sequences  that  share  a  common  functional  anno¬ 
tation.  Recently,  two  methods  based  on  sequence  profiles, 
PRIAM^^  and  EFICAz, 27>28  have  been  proposed  for  pre¬ 
dicting  protein  catalytic  functions  and  have  proven  to  be 
highly  accurate  in  estimating  catalytic  functions  repre¬ 
sented  by  EC  number  for  a  variety  of  enzymes. 


PRIAM  and  EFICAz  generate  enzyme  profiles  by  segre¬ 
gating  enzymes  with  known  EC  numbers  gathered  from 
the  Swiss- Pro t  database.  Enzymes  that  share  a  common 
EC  number  are  grouped  together  to  generate  one  or 
more  profiles  to  characterize  the  EC  number  (or  func¬ 
tion)  of  the  proteins  in  the  group.  In  PRIAM,  the  short¬ 
est  sequence  in  an  EC  group  is  selected  as  the  seed  for 
PSI-BLAST^  searches  against  the  proteins  in  the  group. 
The  searches  generate  a  sequence  profile,  which  is  repre¬ 
sented  in  the  form  of  a  Position  Specific  Scoring  Matrix 
(PSSM),  and  identify  enzymes  in  the  group  that  are  simi¬ 
lar  to  the  profile.  The  identified  enzymes  are  then 
removed  from  the  group,  and  the  process  is  sequentially 
repeated,  each  time  generating  a  new  profile  until  all 
enzymes  in  the  group  have  been  removed.  For  all  pro¬ 
files,  PRIAM  uses  the  same  E-value  cutoff  to  determine 
whether  to  transfer  the  function  of  the  profile  to  the 
query  protein.  Conversely,  rather  than  sequentially  gener¬ 
ating  multiple  profiles  for  each  EC  number,  EFICAz  first 
clusters  enzymes  by  sequence  similarity  within  an  EC 
number  group  and  then  uses  Hidden  Markov  Models^ 
to  generate  profiles  for  each  cluster  of  enzymes.  This 
reduces  the  possibility  of  separating  proteins  with  very 
similar  sequences  in  the  generation  of  multiple  profiles. 
More  importantly,  EFICAz  employs  sequence  identity 
and  negative  samples,  that  is,  proteins  associated  with 
functions  different  from  the  considered  function,  to  es¬ 
tablish  a  specific  cutoff  for  each  profile.  This  can  be 
effective  in  reducing  excessive  false  predictions  for  partic¬ 
ular  EC  numbers,  whereas  methods  that  employ  a  single 
cutoff  for  all  profiles  (e.g.,  PRIAM)  can  only  assure  an 
overall,  average  performance. 

In  this  article,  we  present  a  new  method  termed  Cat- 
Fam  (Catalytic  Families)  to  automatically  infer  protein 
catalytic  functions  using  sequence  profiles.  CatFam’s  pro¬ 
file  generation  procedure  is  similar  to  that  of  EFICAz.  It 
uses  a  hierarchical  clustering  algorithm  to  cluster  enzyme 
sequences  and  employs  negative  samples  to  generate  pro- 
file-specific  cutoffs  that  determine  whether  to  transfer  the 
function  of  the  profile  to  the  query  protein.  However, 
CatFam  employs  ClustalW^O  and  PSI-BLAST  to  generate 
profiles  in  PSSM  format  and  uses  a  stepwise  procedure 
to  control  the  quality  of  a  profile  during  its  generation. 
More  importantly,  unlike  EFICAz,  which  uses  sequence 
identity  between  the  query  protein  and  the  sequences 
used  to  generate  the  profile  to  determine  whether  to 
transfer  the  function  of  the  profile  to  the  query  protein, 
CatFam  uses  the  raw  score  threshold  (RST)  of  the  profile 
itself.  In  contrast  to  sequence  identity,  the  raw  score  of 
the  sequence-profile  alignment  provides  a  direct  measure 
of  the  similarity  between  the  query  protein  and  the 
enzyme  profiles  characterizing  the  catalytic  functions. 
Furthermore,  this  direct  measure  obviates  the  need  to 
maintain  a  sequence  database  of  the  enzymes  used  to 
generate  the  profiles,  which  is  needed  to  compute 
sequence  identity  of  the  query  protein.  Moreover,  because 
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RST  is  associated  with  predefined  nominal  false  positive 
rates  (FPRs),  it  enables  the  generation  of  distinct  profile 
databases  with  different  levels  of  precision  and  recall  that 
are  yet  to  be  implemented  by  other  prediction  methods. 

Next,  we  present  the  CatFam  profile  generation  algo¬ 
rithm.  Then,  we  assess  the  performance  of  the  CatFam 
enzyme  profile  databases  in  various  test  cases  by  compar¬ 
ing  them  against  BLAST  and  the  two  well-established 
profile-based  methods,  PRIAM  and  EFICAz,  for  protein 
catalytic  function  prediction.  Finally,  we  conclude  by 
summarizing  the  major  features  of  CatFam  and  contrast¬ 
ing  it  against  the  two  profile-based  methods. 

METHOD 

Data  preparation 

We  employ  enzyme  and  nonenzyme  protein  data 
annotated  in  the  Swiss-Prot  database  to  construct  data¬ 
sets  for  the  generation  and  testing  of  CatFam  databases. 
The  enzyme  data  consist  of  protein  sequences  and  their 
corresponding  EC  numbers.  The  EC  numbers  are  consist¬ 
ent  with  records  in  the  Enzyme  Nomenclature  Database 
(http://www.expasy.ch/enzyme/),  which  cross-reference  all 
enzymes  in  Swiss-Prot.  We  label  proteins  as  nonenzymes 
by  following  a  rule  adapted  from  that  used  by  EFICAz:  a 
protein  in  Swiss-Prot  is  classified  as  a  nonenzyme  if  no 
EC  number,  no  enzyme  keywords,  and  no  words  indicat¬ 
ing  less  reliable  function  annotation,  such  as  hypothetical 
and  putative,  are  associated  with  the  protein. 

We  assume  that  the  manual  annotations  in  the  Swiss- 
Prot  database  provide  an  appropriate  “gold  standard”  to 
train  CatFam.  Although  errors  inevitably  exist  in  this 
database,  a  recent  study  indicates  that  most  errors  are 
due  to  under- annotation,  that  is,  missed  enzyme  annota¬ 
tions,  and  a  substantial  number  of  such  omissions  will  be 
corrected  in  the  next  Swiss-Prot  release.^!  Moreover,  the 
detrimental  effect  of  sporadic  annotations  of  wrong  pro¬ 
tein  functions  for  a  small  number  of  enzymes  can  be 
reduced  when  they  are  merged  with  a  large  number  of 
correctly  annotated  enzymes  to  generate  a  sequence  pro¬ 
file.  In  combination  with  precision  control  during  profile 
generation,  this  ensures  that  annotations  performed  by 
CatFam  do  not  lead  to  over-predictions,  which  are  the 
most  detrimental  type  of  errors  propagated  in  data¬ 
bases.^ 

The  primary  dataset  used  in  this  study  consists  of 
189,178  proteins  (75,687  enzymes  and  113,491  nonen¬ 
zymes)  from  Swiss-Prot  released  in  November  2006. 
About  90%  of  the  enzymes  and  nonenzymes  are  ran¬ 
domly  selected  to  form  a  training  dataset  Dtr  to  generate 
CatFam  databases.  The  remaining  proteins,  7600  enzymes 
and  11,349  nonenzymes,  are  set  aside  to  form  a  testing 
enzyme  dataset  Denzi  and  a  testing  nonenzyme  dataset 
Dnz,  respectively.  Using  the  latest  Swiss-Prot  release  (Au¬ 
gust  2007),  we  form  a  secondary  testing  enzyme  dataset 


Denz2,  consisting  of  8399  newly  added  enzymes.  We  use 
Denz2  as  a  surrogate  to  assess  how  well  the  CatFam  data¬ 
bases  can  predict  the  catalytic  functions  of  future,  yet 
unannotated  proteins. 

Enzyme  profile  database  generation 

A  sequence  profile  generated  from  protein  sequences 
of  a  common  function  reveals  the  functionally  conserved 
amino  acid  patterns  of  the  sequences.  Hence,  a  protein 
that  matches  such  a  profile  can  be  annotated  by  the 
function  associated  with  the  profile.  We  generate  profiles 
from  enzymes  that  are  annotated  with  the  same  EC  num¬ 
ber  in  the  training  dataset  Dtr.  For  each  EC  number  g , 
one  or  more  profiles  are  generated  by  the  following 
procedures: 

a.  Create  a  subset  of  enzymes  Dg  from  Dw  consisting 
of  enzymes  with  EC  number  g. 

b.  Compute  the  sequence  similarity  between  each  pair 
of  enzymes  in  Dg.  This  is  performed  by  all- against- all 
BLAST  searches,  where  sequence  similarity  is  measured 
based  on  the  E-value  of  the  alignment  of  each  pair  of 
sequences. 

c.  Cluster  enzymes  by  their  sequence  similarity,  that  is, 
E- value,  using  a  hierarchical  clustering  algorithm.^  Ini¬ 
tially,  each  sequence  forms  a  cluster.  Then,  we  perform  a 
pairwise  search  among  all  clusters  and  merge  two  clus¬ 
ters,  Q  and  Cp  which  have  the  smallest  cost  function 

E(Q,  Cj)  =  max[£(a,  b\  \/a  £  C;,  Mb  £  Cj\,  (1) 

into  one  cluster.  Here,  E(a,b)  denotes  the  E- value 
between  protein  sequences  a  and  b  in  clusters  Q  and  Cp 
respectively.  Next,  we  sequentially  continue  this  merging 
procedure  until  the  cost  function  F  exceeds  a  preset  limit, 
at  which  point  we  have  partitioned  Dg  into  a  total  of  K 
distinct  clusters  Q,  k  =  1,2,. .  .,K  The  proteins  in  each 
cluster  are  used  to  initialize  a  profile- generation  set  S k 
for  cluster  Ck. 

d.  Generate  one  profile  for  cluster  Ck.  This  is  achieved 
by  using  ClustalW  to  perform  multiple  sequence  align¬ 
ment  (MSA)  for  protein  sequences  in  set  Sk,  followed  by 
PSI- BLAST  searches  to  generate  a  PSSM  format  profile 
Pk,m>  m  —  1,2,. .  .,M,  where  M  is  the  total  number  of  pro¬ 
files  used  to  represent  cluster  Ck. 

e.  Expand  the  set  Sk  by  adding  one  additional  sequence 
s  from  Dg.  The  expanded  set  is  used  to  generate  a  new 
profile  for  Ck.  The  added  sequence  s  is  selected  as  the 
one  that  is  most  similar  to  all  sequences  already  in  Sk, 
that  is,  the  sequence  that  has  the  smallest  cost  function 

E(s,  Sk)  =  max[£(s,  b),\/b  £  Sk\.  (2) 

This  gradual  addition  of  divergent  sequences  preserves 
the  quality  of  the  MSA  used  to  generate  the  profile  for 

ck. 
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f.  Return  to  Step  (d)  to  generate  another  profile  pk,m 
for  Cjt  unless  one  of  the  two  following  conditions  that 
terminate  the  expansion  of  the  set  Sk  in  Step  (e)  is  met: 
(1)  there  are  no  remaining  proteins  in  D ^  or  (2)  the 
MSA  does  not  have  a  single  fully  conserved  position.  The 
second  condition  prevents  the  addition  of  proteins  to  Sk 
that  are  too  divergent  and  would  have  significantly  low¬ 
ered  the  quality  of  the  generated  profile  for  Q. 

g.  Select  the  best  profile  for  cluster  Ck.  A  series  of  pro¬ 
files  pk,i>pk,2> •  •  •> Pk,M  are  generated  through  iterations  of 
Steps  (d)  and  (e).  For  each  profile,  we  first  use  PSI- 
BLAST  to  align  all  proteins  in  Dtr  with  that  profile. 
Then,  we  rank  order  the  proteins  according  to  their  raw 
score  value  and  compute  the  FPR  (i.e.,  the  fraction  of 
proteins  with  EC  number  other  than  g)  associated  with 
each  score.  Next,  starting  with  the  largest  raw  score,  we 
search  through  the  list  of  ranked  proteins  and  identify 
the  one  associated  with  the  largest  FPR  that  passes  a  pre¬ 
defined  threshold.  The  corresponding  raw  score  is  labeled 
the  RST,  which  is  used  as  the  cutoff  for  the  profile. 
Finally,  for  each  profile,  we  compute  the  number  of  true 
positives  (i.e.,  number  of  proteins  with  EC  number  g) 
associated  with  the  corresponding  RST  and  select  as  the 
best  profile  for  cluster  Q  the  one  with  the  largest  num¬ 
ber  of  true  positives. 

h.  Return  to  Step  (d)  to  generate  a  profile  for  another 
cluster  Ck  until  the  procedure  is  completed  for  all  K 
clusters. 

This  procedure  allows  the  user  to  define  the  nominal 
FPR  for  each  EC  number  during  the  generation  of  the 
CatFam  databases.  Therefore,  the  user  can  select  low 
FPRs  to  generate  databases  with  highly  accurate  enzyme 
annotation  or  high  FPRs  to  construct  databases  with 
high  recall. 

RESULTS 

We  assess  the  performance  of  the  CatFam  databases  by 
comparing  and  contrasting  them  against  well-established 
resources  for  predicting  protein  catalytic  function,  such 
as  BLAST,  PRIAM,  and  EFICAz.  The  availability  of 
BLAST  and  PRIAM’s  source  code  allows  us  to  compara¬ 
tively  assess  CatFam’s  performance  for  the  three  custom¬ 
ized  testing  datasets,  Denzl,  Denz2,  and  Dnz,  discussed  ear¬ 
lier.  Conversely,  due  to  the  unavailability  of  the  EFICAz 
code,  comparisons  with  it  are  limited  to  precomputed 
enzyme  functions  available  on  its  Web- site  in  September 
2007  (http://cssb2.biology.gatech.edu/EFICAz/). 

CatFam databases 

To  test  the  performance  of  the  proposed  enzyme-pre- 
diction  algorithm,  we  construct  four  CatFam  databases, 
consisting  of  enzyme  profiles  and  their  associated  EC 
numbers  and  raw  score  thresholds.  We  construct  each  of 
the  four  databases  to  satisfy  one  nominal  FPR  (1,  5,  10, 


Table  I 

Distribution  of  EC  Numbers  Used  in  the  Development  of  the  CatFam 
Databases 


Distinct  EC  numbers  in  the  Swiss-Prot  database8  2220 

Distinct  EC  numbers  in  the  training  dataset  Dtr  1885 

Number  of  profiles  in  the  CatFam  database13  8080 

Distinct  EC  numbers  in  the  CatFam  database13  1653 

Distinct  EC  numbers  in  Denz]  856 

Distinct  EC  numbers  in  De nz2  545 


aSwiss-Prot  database  released  in  November  2006. 

bThese  numbers  correspond  to  the  CatFam  database  with  1%  false  positive  rate. 
The  numbers  for  other  CatFam  databases  are  slightly  larger. 

and  25%)  specified  during  profile  generation  and  test 
their  ability  to  predict  enzymes  labeled  by  four- digit  EC 
numbers.  Table  I  lists  the  distribution  of  distinct  EC 
numbers  used  in  the  development  of  the  CatFam  data¬ 
bases.  Out  of  the  2220  distinct  EC  numbers  in  the  Swiss- 
Prot  database,  1885  (86%)  are  covered  in  the  training 
dataset  Dtr.  Each  of  the  335  not  covered  EC  numbers  cor¬ 
responds  to  only  one  enzyme  in  the  Swiss-Prot  database, 
making  them  unsuitable  for  profile  generation.  For  a  1% 
FPR,  CatFam  generates  8080  profiles  for  1653  EC  num¬ 
bers,  comprising  88%  of  the  EC  numbers  in  Dtr.  For  the 
remaining  12%,  profiles  are  not  generated  because  of  the 
insufficient  number  of  training  enzymes.  For  the  other 
FPRs,  CatFam  generates  similar  number  of  profiles,  cov¬ 
ering  a  comparable  amount  of  EC  numbers.  Because  the 
testing  dataset  Denzl  is  created  by  randomly  selecting 
10%  of  the  enzymes  for  each  EC  number,  no  testing 
enzymes  are  selected  for  EC  numbers  associated  with  less 
than  10  enzymes.  Thus,  the  testing  dataset  E>enzl  only 
covers  about  half  of  the  EC  numbers  in  the  CatFam  data¬ 
bases.  Interestingly,  the  dataset  Denz2,  which  contains 
newly  annotated  enzymes,  has  even  fewer  distinct  EC 
numbers  than  Denz 

Assessment  of  CatFam’s  performance 

We  first  assess  the  capability  of  the  CatFam  databases 
to  discriminate  between  enzymes  and  nonenzymes  and 
compare  their  performance  against  BLAST  searches  on 
the  training  dataset  Dtr.  A  query  protein  is  labeled  as  an 
enzyme  if  an  EC  number  is  assigned  to  it  by  CatFam  or 
if  a  BLAST  search  finds  an  enzyme  in  Dtr  with  E-value 
less  a  given  cutoff.  Figure  1  shows  the  combined  results 
against  the  enzyme  (Denzl)  and  nonenzyme  (Dnz)  data¬ 
sets.  As  expected,  smaller  E-value  cutoffs  in  BLAST 
searches  decrease  the  false  identification  of  nonenzymes, 
while  increasing  the  misidentification  of  enzymes.  The 
results  of  the  CatFam  databases  with  decreasing  FPRs 
yield  a  similar  trend.  However,  when  compared  with 
BLAST,  for  a  fixed  number  of  misidentified  enzymes, 
CatFam  yields  a  much  smaller  number  of  falsely  identi¬ 
fied  nonenzymes.  This  suggests  that  the  CatFam  profiles 
are  effective  in  characterizing  enzyme  catalytic  functions, 
effectively  distinguishing  enzymes  and  nonenzymes. 
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Figure  1 

Comparison  of  CatFam  databases  and  BLAST  searches  for  the 
discrimination  of  enzymes  and  nonenzymes  in  datasets  Denzl  and  Dnz. 

A  query  protein  is  labeled  as  an  enzyme  if  an  EC  number  is  assigned  by 
CatFam  or  if  a  BLAST  search  against  the  training  dataset  Dtr  finds  an 
enzyme  with  E-value  less  than  a  given  cutoff.  The  figure  shows  some  of 
these  E-value  cutoffs. 

We  further  assess  the  catalytic  function  prediction  of 
the  four  CatFam  databases  by  computing  precision  and 
recall  for  the  two  testing  enzyme  datasets,  Denzl  and 
Denz2,  and  f°r  the  nonenzyme  dataset,  Dnz,  and  compar¬ 
ing  the  results  against  PRIAM’s  predictions.  Precision 
and  recall  are  defined  as: 

Precision  =  TP/(TP  +  FP) 

Recall  =  TP/(TP  +  FN) 

where  TP,  FP,  and  FN  denote  true  positives,  false  positives, 
and  false  negatives,  respectively.  For  each  testing  set,  TP  is 
the  number  of  predicted  EC  numbers  that  are  consistent 
with  the  proteins’  original  EC  number  assignments,  FP  is  the 
number  of  predicted  EC  numbers  that  do  not  match  the  pro¬ 
teins’  original  assignments,  and  FN  is  the  number  of  origi¬ 
nally  assigned  EC  numbers  that  are  not  predicted. 

Table  II  compares  the  performance  of  the  two  methods 
for  the  different  testing  datasets.  The  results  for  the  two 
testing  enzyme  datasets  indicate  each  method’s  perform¬ 
ance  for  the  case  where  all  proteins  are  enzymes.  Situa¬ 
tions  like  these  may  arise  in  genome  reannotations  per¬ 
formed  to  update  or  refine  existing  enzyme  annotations. 
The  results  for  the  dataset  that  combines  Denzl  and  Dnz 
mimic  the  performance  of  automated  protein  annota¬ 
tions  for  newly  sequenced  genomes,  involving  both 
enzymes  and  nonenzymes. 

As  expected,  the  CatFam  databases  consistently  achieve 
higher  precision  and  lower  recall  with  smaller  preset 
FPRs.  In  the  case  of  Denzl,  when  the  FPR  changes  from 
25%  to  1%,  precision  increases  from  96.0  to  99.2%  and 


recall  decreases  from  97.6  to  95.0%.  When  compared 
with  Denzl,  the  precision  of  each  CatFam  database  for 
De nz2,  which  consists  of  enzymes  recently  annotated  and 
added  to  the  Swiss-Prot  database,  drops  by  less  than 
1.0%.  This  suggests  a  consistently  high  reliability  of  Cat- 
Fam’s  predictions.  Conversely,  recall  for  Denz2  drops 
slightly  more  than  10.0%  for  each  of  the  CatFam  data¬ 
bases.  The  lower  recall  is  attributed  to  CatFam  databases, 
trained  on  a  previous  release  of  Swiss-Prot,  not  being 
able  to  characterize  new  sequence  patterns  in  enzymes 
recently  added  to  the  Swiss-Prot  database.  This  is  sup¬ 
ported  by  our  observation  that  there  are  as  many  as  665 
(8%)  proteins  in  Denz 2,  compared  with  51  (0.7%)  pro¬ 
teins  in  Denzl,  that  have  less  than  15%  sequence  similar¬ 
ity  with  the  enzymes  used  to  train  the  CatFam  profiles, 
as  shown  in  Figures  2(a,b).  In  addition,  if  enzymes  in 
Denz2  have  catalytic  functions  associated  with  orphan 
enzymes  or  orphan  genes  in  the  previous  Swiss-Prot 
database,  they  will  not  be  predicted  either.  When  com¬ 
paring  the  precision  of  the  composite  dataset  that  com¬ 
bines  both  enzymes  and  nonenzymes  Denzl  +  Dnz  with 
Denzi,  we  find  that  the  addition  of  nonenzymes  causes  a 
slight  decrement  in  precision,  monotonically  decreasing 
it  by  1.5-0. 6%  as  the  preset  FPR  changes  from  25%  to 
1%.  This  further  indicates  that  CatFam  databases  can 
accurately  discriminate  enzymes  from  nonenzymes. 

Comparisons  between  PRIAM  and  the  CatFam  data¬ 
bases  show  that  both  PRIAM’s  precision  and  recall  are 
consistently  and  systematically  lower  than  those  of  all 
four  CatFam  databases’  results  for  all  testing  datasets. 
Although  PRIAM’s  precision  for  the  enzyme  datasets, 
De nzi  and  Den z2,  is  only  about  4.0%  lower  than  CatFam’s 
results  for  25%  FPR,  its  recall  is  about  10.0%  lower.  Con¬ 
sistently,  PRIAM’s  precision  and  recall  for  the  composite 
dataset,  Denzl  +  Dnz,  are  about  10.0%  lower  than  those 
for  CatFam’s  with  25%  FPR.  These  results  clearly  suggest 
that  CatFam  outperforms  PRIAM  in  discriminating 
between  enzymes  and  nonenzymes. 

The  performance  of  sequence-based  protein  function 
annotation  methods  is  highly  dependent  on  the  sequence 
identity  between  the  query  protein  and  the  proteins  with 
known  function  used  for  method  development.  To 

Table  II 

Comparison  of  Catalytic  Function  Predictions  of  Four  CatFam 
Databases  Versus  PRIAM,  Using  Two  Testing  Enzyme  Datasets,  Denzl 
and  E>enz2,  and  One  Nonenzyme  Dataset,  E>nz 

CatFam 

Preset  false  positive  rate  - 


(FPR) 

1% 

5% 

10% 

25% 

PRIAM 

Penzl 

Precision  (%) 

99.2 

98.5 

97.2 

96.0 

93.4 

Recall  (%) 

95.0 

96.4 

97.0 

97.6 

87.9 

Penz2 

Precision  (%) 

99.0 

97.9 

96.6 

95.3 

91.4 

Recall  (%) 

82.3 

84.5 

86.3 

87.4 

76.3 

Penzl  T  ^nz 

Precision  (%) 

98.6 

97.5 

95.9 

94.5 

82.6 

Recall  (%) 

95.0 

96.4 

97.0 

97.6 

87.9 
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Figure  2 

Precision  and  recall  as  a  function  of  maximum  sequence  identity  (MSI)  for  PRIAM  and  two  catalytic  family  (CatFam)  databases,  one  with  1%  false 
positive  rate  (FPR)  and  another  with  25%  FPR.  The  MSI  distribution  for  proteins  in  the  testing  datasets  Denzl  and  Denz2  are  shown  in  Figure 
2(a,b),  respectively. 


compare  the  performance  of  CatFam  and  PRIAM  as  a 
function  of  protein  sequence  identity,  we  sort  the  testing 
results  according  to  the  maximum  sequence  identity 
(MSI)  between  the  query  proteins  and  the  proteins  used 
for  profile  generation.  In  such  comparison,  we  make  an 
assumption  that  the  distribution  of  proteins  used  to  con¬ 
struct  PRIAM,  which  we  do  not  know,  would  result  in 
similar  MSI  when  queried  against  Denzl  and  Denz2.  This 
assumption  is  reasonable  because  both  PRIAM  and  Cat¬ 
Fam  are  based  on  the  Swiss-Prot  database,  and  their 
release  dates  are  within  4  months  of  each  other.  The 
resulting  distribution  of  the  query  proteins  in  these  two 
datasets  against  the  proteins  used  for  profile  generation 
in  CatFam  are  shown  in  Figures  2(a,b),  respectively, 
where  the  MSIs  are  binned  in  5%  increments,  ranging 
from  15  to  95%.  Figures  2(c-g)  show  precision  and  recall 
results  for  PRIAM  and  the  1  and  25%  FPR  CatFam  data¬ 


bases  as  a  function  of  MSI.  The  performance  of  the  5 
and  10%  FPR  CatFam  databases  is  not  shown  because 
they  have  similar  trends  and  are  bounded  by  the  pre¬ 
sented  plots. 

We  observe  similar  trends  for  all  precision  curves:  pre¬ 
cision  is  kept  high  and  does  not  significantly  decrease 
until  the  MSI  decreases  to  a  turning  point,  from  which 
precision  decreases  with  decreasing  MSI.  This  point  cor¬ 
responds  to  25-30%  MSI  for  the  CatFam  databases  and 
about  35-40%  MSI  for  PRIAM.  In  particular,  for  query 
proteins  with  more  than  30%  MSI,  catalytic  function 
predictions  with  the  1%  FPR  CatFam  database  can 
achieve  better  than  93.0%  precision  and  as  much  as 
98.0%  if  it  is  known  [like  in  Figures  2(c,d)]  that  the 
query  proteins  are  enzymes.  For  the  CatFam  database 
with  25%  FPR,  while  still  high,  precision  is  reduced  to 
87.0%  and  93.0%,  respectively,  for  the  composite  dataset 
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and  enzyme  datasets.  PRIAM’s  precision  is  consistently 
below  the  corresponding  values  for  the  two  CatFam  data¬ 
bases.  The  plots  for  recall  shown  in  Figures  2(f,g)  show 
similar  behavior.  For  Denzl,  the  1%  FPR  CatFam  database 
can  achieve  better  than  89.0%  recall  when  the  MSI  is 
larger  than  45%,  whereas  such  recall  is  achieved  when 
the  MSI  is  larger  than  35.0%  for  the  database  with  25% 
FPR.  For  De nz2,  the  MSI  needs  to  be  larger  than  50%  for 
the  two  CatFam  databases  to  achieve  better  than  89.0% 
recall.  PRIAM’s  recall  for  either  of  the  two  testing  data¬ 
sets  rarely  achieves  the  89.0%  mark. 

Predictions  for  multi-domain  and 
multi-functional  enzymes 

CatFam  was  not  developed  to  identify  functional 
domains  in  protein  sequences  and  reveal  their  connec¬ 
tions  with  catalytic  functions.  However,  we  find  that  the 
clustering  step  in  the  profile  generation  process  does  clas¬ 
sify  multi-functional  enzymes,  which  quite  likely  contain 
multiple  domains.  For  example,  the  training  set  of  395 
enzymes  used  to  generate  profiles  for  EC  2.7.7.48  (RNA 
directed  RNA  polymerase)  includes  246  multi-functional 
enzymes,  which  are  grouped  into  47  clusters  with  differ¬ 
ent  catalytic  functions  in  addition  to  EC  2.7.7.48.  One  of 
such  clusters  contains  20  enzymes  with  EC  2.1.1.56,  and 
another  contains  22  enzymes  with  EC  3.1.3.33.  Out  of 
the  47  clusters,  only  three  clusters  contain  enzymes  that 
do  not  share  other  EC  numbers.  This  suggests  that 
multi-functional  enzymes,  sharing  a  common  secondary 
function,  are  further  grouped  into  clusters  according  to 
their  shared  secondary  functions  or,  perhaps,  different 
domain  structures.  In  addition,  training  enzymes  with 
multiple  functions  are  used  to  generate  profiles  related  to 
each  of  their  EC  numbers.  This  enables  the  determina¬ 
tion  of  all  EC  numbers  for  a  multi-functional  enzyme. 
For  example,  the  1%  FPR  CatFam  database  correctly  pre¬ 
dicts  569  (91%)  out  of  622  EC  numbers  that  are  assigned 
to  283  multi-functional  enzymes  in  the  testing  enzyme 
dataset  Denz i,  with  only  three  false  predictions. 

Analysis  of  false  predictions 

We  analyze  the  small  number  of  false  predictions  made 
by  the  1%  FPR  CatFam  database  for  enzymes  in  the 
De nzi  dataset  and  for  nonenzymes  in  the  Dnz  dataset.  For 
the  7600  enzyme  in  the  Denzl  dataset,  CatFam  incorrectly 
predicts  58  EC  numbers  (or  0.76%)  for  57  enzymes.  Ta¬ 
ble  III  lists  these  58  predictions  along  with  their  true  EC 
annotations  based  on  the  Swiss-Prot  database  (November 
2006,  release).  These  correspond  to  36  distinct  EC  num¬ 
bers  out  of  a  total  of  856  distinct  EC  numbers  in  Denzl. 

Our  analysis  finds  that  eight  out  of  the  58  EC  predic¬ 
tions  (highlighted  with  bold  font  in  the  table)  that  are  in 
fact  correct.  These  include  two  EC  numbers  predicted  for 
protein  DPOL_HBVAW,  whose  annotation  has  been  re¬ 


vised  in  the  most  recent  Swiss-Prot  database  (February 
2008).  The  other  six  predicted  EC  numbers  represent  cat¬ 
alytic  activities  that  either  subsume  or  are  subsumed  by 
the  EC  numbers  assigned  by  Swiss-Prot.  For  example,  EC 
2.4.1.21  (starch  synthase  using  ADP-glucose)  predicted 
for  proteins  SSGl_HORVU  and  SSG1_MANES  are  sub¬ 
sumed  by  EC  2.4.1.242  (starch  synthase  using  either 
ADP-glucose  or  UDP-glucose).  In  another  example,  EC 
2.7. 1.1  (hexokinase)  is  predicted  for  protein  HXK4_RAT, 
which  subsumes  EC  2.7. 1.2  (glucokinase).  For  the  other 
50  false  predictions,  we  find  that  39  of  them  correspond 
to  EC  numbers  (and  functions)  that  are  very  similar  to 
the  true  annotations.  The  differences  are  only  at  the  sub¬ 
strate  or  cofactor  levels,  which  are  usually  reflected  on 
the  last  of  the  four- digit  EC  number  annotation.  Such 
differences  account  for  30  out  of  these  39  false  predic¬ 
tions.  The  other  nine  false  predictions  are  also  associated 
with  substrate- level  inferences,  although  related  to  multi¬ 
ple  EC-digit  errors.  In  eight  of  such  cases,  CatFam  misi- 
dentifies  NADH  dehydrogenase  that  acts  on  quinione 
(EC  1.6.99.5)  for  NADH  dehydrogenase  that  acts  on 
ubiquinone  (EC  1.6. 5.3),  and  in  one  case  CatFam  makes 
the  converse  mistake,  that  is,  it  predicts  EC  1.6.99.5  for 
EC  1.6. 5. 3. 

It  should  be  noted  that  CatFam  does  not  make  system¬ 
atic  errors  for  particular  EC  numbers,  in  that  one  EC 
number  is  not  always  predicted  for  another.  This  is 
observed  in  Table  III,  which  shows  that  only  eight  EC 
numbers  (underlined)  have  error  rates  higher  than  10%. 
Most  of  these  EC  numbers  are  underrepresented  in  both 
the  training  and  the  testing  datasets,  significantly  contrib¬ 
uting  to  the  higher  error  rates. 

For  the  11,349  nonenzymes  in  the  Dnz  dataset,  CatFam 
incorrectly  provides  an  EC  number  for  47  (or  0.41%), 
which  are  distributed  through  27  distinct  EC  numbers. 
Table  IV  lists  the  six  EC  numbers  that  are  incorrectly 
assigned  to  more  than  one  nonenzyme  and  the  corre¬ 
sponding  number  of  enzymes  used  to  train  the  related 
CatFam  profiles.  Except  for  EC  2.4.1.129,  the  number  of 
nonenzymes  incorrectly  predicted  is  roughly  proportional 
to  the  number  of  training  enzymes.  This  is  a  consequence 
of  the  constraints  imposed  by  the  specified  FPR  during 
the  training  process.  For  a  large  number  of  training 
enzymes,  a  fixed  rate  of  acceptable  false  predictions  yields 
a  large  number  of  incorrectly  predicted  nonenzymes.  The 
false  prediction  related  with  EC  2.4.1.129  is  attributed  to 
the  small  number  of  proteins  (13)  used  to  construct  its 
profile.  In  addition,  CatFam  occasionally  predicts  EC 
numbers  for  close  homologs  of  enzymes  that  do  not  pos¬ 
sess  catalytic  activity,  as  they  lack  the  necessary  active 
sites.  For  example,  CatFam  predicts  EC  3.2.1.17  for  pro¬ 
teins  SACA3_HUMAN  and  SACA_MOUSE  that  have 
50%  sequence  similarity  with  the  training  enzymes.  How¬ 
ever,  these  two  proteins  lack  catalytic  activity  because  the 
required  residues  at  positions  122  (Glu)  and  139  (Asp) 
are  not  conserved.  In  another  case,  CatFam  predicts  EC 
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Table  III 

Analysis  of  58  Incorrect  EC  Predictions  for  the  Testing  Enzyme  Dataset  Denzl 


Protein  accession8  Predicted  EC  True  ECb  Error  rate  (%)c  Catalytic  function  description01 


LDH.BOTBR 

1.1.1.37 

1.1.1.27 

9.0 

L-lactate  dehydrogenase  [Malate  dehydrogenase] 

LDH  THEMA 

1.1.1.37 

1.1.1.27 

GPDA_TRYBB 

1.1.1.94 

1.1. 1.8 

4.0 

Glycerol-3-phosphate  dehydrogenase  (NAD(  +  )) 

[Glycerol-3-phosphate  dehydrogenase  (NAD(P)(  +  ))\ 

NU0H1_RH0PB 

1. 6.5.3 

1.6.99.5 

6.0 

NADH  dehydrogenase  (quinone) 

NUOH  AZOSE 

1. 6.5.3 

1.6.99.5 

[NADH  dehydrogenase  ( ubiquinone )] 

NUOH  BORPA 

1. 6.5.3 

1.6.99.5 

NU0H_RH0RT 

1. 6.5.3 

1.6.99.5 

NUOH  THICR 

1. 6.5.3 

1.6.99.5 

NU0I1_RH0S4 

1. 6.5.3 

1.6.99.5 

NU0I  RHORT 

1. 6.5.3 

1.6.99.5 

NU0K_RH0CA 

1. 6.5.3 

1.6.99.5 

NU0G_RICCN 

1.6.99.3 

1.6.99.5 

5.0 

NADH  dehydrogenase  (quinone) 

NU0G_RICPR 

1.6.99.3 

1.6.99.5 

[NADH  dehydrogenase] 

NDUS2_RECAM 

1.6.99.5 

1.6.99.3 

5.0 

NU1M  METSE 

1.6.99.5 

1. 6.5.3 

0DP2_ACHLA 

2.3.1.61 

2.3.1.12 

33e 

Dihydrolipoyllysine-residue  acetyltransferase  [Dihydrolipoyllysine-residue 

succinyltransf erase] 

AMY_BACCI 

2.4.1.19 

3.2.1. 1 

50 

1,4-a-D-glucan  glucanohydrolase  [Cyclodextrin  glucanotransferase] 

SSG1_H0RVUf 

2.4.1.21 

2.4.1.242 

Starch  synthase  that  uses  either  UDP-  or  ADP-  glucose  [Starch  synthase  that 

SSG1_MANES 

2.4.1.21 

2.4.1.242 

uses  ADP  glucose] 

APT_YERPE 

2.4.2.10 

2.4.2.7 

9.0 

Adenine  phosphoribosyltransferase  [Orotate  phosphoribosyltransferase] 

0AT_0CEIH 

2.6.1.11 

2.6.1.13 

8.0 

Ornithine  aminotransferase  [Acetylornithine  aminotransferase] 

HXK4_RAT 

2.7.1. 1 

2.7.1 .2 

Glucokinase  [Hexokinase] 

FER_HUMAN 

2.7.10.1 

2.7.10.2 

4.0 

Protein-tyrosine  kinase  [Protein-tyrosine  kinase  with  an  additional 

transmembrane  domain] 

PPK5_SCHP0 

2.7.12.1 

2.7.11.1 

Nonspecific  serine/threonine  protein  kinase  [Dual-specificity  kinase  for  both 

serine/threonine  and  tyrosine] 

KAPB  YEAST 

2.7.11.1 

2.7.11.11 

4.0 

cAMP  dependent  protein  kinase 

KPCD_CANFA 

2.7.11.1 

2.7.11.13 

Calcium-dependent  protein  kinase 

PLK4_M0USE 

2.7.11.1 

2.7.11.21 

Polo  serine/threonine  protein  kinase,  catalyzes  same  reaction  but  associates 

with  the  spindle  pole 

PSK1_SCHP0 

2.7.11.1 

2.7.11.22 

Cyclin-dependent  protein  kinase 

ARGA  PSESM 

2.7. 2.8 

2.3.1. 1 

6.0 

Amino-acid  A/-acetyltransferase  [Acetylglutamate  kinase] 

DP0L_HBVAW 

2.7.7.49 

2.7.7.49 

RNA-directed  DNA  polymerase 

DPOL  HBVAW 

3.1.26.4 

3.1.26.4 

Ribonuclease 

UBC2_MIMIV 

2.7.7. 6 

6.3.2.19 

1.0 

Ubiquitin  protein  ligase  [DNA-directed  RNA  polymerase] 

TREX2_HUMAN 

21.1.1 

3.1.11.2 

1.0 

3'-5'  exonuclease  [DNA-directed  DNA  polymerase] 

MGTA  THENE 

3.2.1. 1 

2.4.1.25 

14 

4-a-glucanotransferase  [a- amylase] 

GUX6_HUMIN 

3. 2.1. 4 

3.2.1.91 

12 

Exoglucanase  [Endoglucanase] 

GUX_CELFI 

3.2.1. 4 

3.2.1.91 

GUNB_CALSA 

3.2.1. 8 

3.2.1. 4 

11 

Endoglucanase  [Endo-l,4-fi -xylanase] 

ATPL  PROMO 

3.6.3.14 

3.6.3.15 

0.4 

Sodium  ion  specific  ATP  synthase  [ATP  synthase] 

ULADJVIYCPN 

4.1.1.23 

4.1.1.85 

3.0 

3-dehydro-L-gulonate-6-phosphate  decarboxylase  [Orotidine-5' -phosphate 

decarboxylase] 

TRPF_KLULA 

4.1.1.48 

5.3.1.24 

18 

Phosphoribosylanthranilate  isomerase  [lndole-3-glycerol-phosphate  synthase] 

TRPF_ZYGBA 

4.1.1.48 

5.3.1.24 

PABB  BACSU 

4.1.3.27 

2.6.1.85 

9.0 

Aminodeoxychorismate  synthase  [Anthranilate  synthase] 

LYS4_SCHP0 

4.2.1.33 

4.2.1.36 

3.0 

Homoaconitate  hydratase  [3-isopropylmalate  dehydratase] 

ISPD_RH0S4 

4.6.1.12 

2.7.7.60 

4.0 

4-diphosphocytidyl-2-C-methyl-D-erythritol  synthase  [2-C-methyl-D-erythritol-2f4- 

cyclodiphosphate  synthase] 

PHEA_METJA 

5.4.99.5 

4.2.1.51 

50 

Prephenate  dehydratase  [Chorismate  mutase] 

SYWC  BOVIN 

6.1. 1.1 

6.1. 1.2 

14 

Tryptophan  translase  [Tyrosine  translase] 

SYWC_M0USE 

6.1. 1.1 

6.1. 1.2 

SYWC  RABIT 

6.1. 1.1 

6.1. 1.2 

SYW_CL0L0 

6.1. 1.1 

6.1. 1.2 

SYN  PYRKO 

6.1.1.12 

6.1.1.22 

4.0 

Asparagine  translase  [Aspartic  acid  translase] 

SYT_THET8 

6.1.1.15 

6.1. 1.3 

3.0 

Threonine  translase  [Proline  translase] 

SYQ  CLOPE 

6.1.1.17 

6.1.1.18 

5.0 

Glutamine  translase  [Glutamic  acid  translase] 

SYQ_PSESM 

6.1.1.17 

6.1.1.18 

SYK  STRMU 

6.1.1.20 

6.1. 1.6 

4.0 

Lysine  translase  [Phenylalanine  translase] 

SYMC  CAEEL 

6.1.1.20 

6.1.1.10 

4.0 

Methionine  translase  [Phenylalanine  translase] 

E2AK4  HUMAN 

6.1.1.21 

2.7.11.1 

3.0 

Nonspecific  serine/threonine  protein  kinase  [Histidine  translase] 

a,bProtein  accessions  and  their  true  EC  numbers  are  obtained  from  the  Swiss-Prot  database  released  in  November  2006. 
cError  rate  is  the  percentage  of  false  predictions  for  a  given  EC  number. 

dOfficial  enzyme  names  for  the  true  EC  number  (normal  font)  and  the  predicted  EC  number  (italic  font  in  the  square  brackets). 
eError  rates  greater  than  10%  are  underlined. 

fEC  predictions  that  are  in  fact  correct  are  highlighted  by  bold  font. 
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Table  IV 

Distribution  of  EC  Numbers  Incorrectly  Assigned  to  More  Than  One 
Nonenzyme  in  the  Dnz  Dataset 


Predicted  EC 
number 

Number  of  nonenzymes 
incorrectly  predicted 

Number  of  training 
enzymes 

2.4.1.129 

2 

13 

2.4.2.7 

2 

183 

2.7.7.48 

4 

432 

2.7. 7. 6 

13 

1762 

3.1. 1.4 

2 

270 

3.2.1.17 

2 

155 

3. 1.1.4  (phospholipase)  for  protein  PA2H_ZHAMA, 
which  has  82%  sequence  similarity  with  the  training 
enzymes  and  has  active  sites  and  sequence  patterns 
recorded  by  PROSITE^  for  that  EC  number.  However, 
experimental  studies  do  not  show  catalytic  activity  for 

that  protein.  35 


Catalytic  function  annotation  for 
whole  genomes 

To  evaluate  the  performance  of  CatFam  for  whole  ge¬ 
nome  annotation,  we  select  two  Yersinia  genomes  [Y.  pes- 
tis  mediaevails  (ypm)  and  Y.  pseudotuberculosis  IP  32953 
(yps)]  and  11  category  A  and  B  bacterial  pathogens  listed 
by  the  Centers  for  Disease  Control  and  Prevention  [Ba¬ 
cillus  anthracis  Ames  Ancestor  (bar),  Burkholderia  mallei 
ATCC  23344  (bma),  Burkholderia  pseudomallei  K96243 
(bps),  Brucella  melitensis  16M  (bme),  Clostridium  botuli- 
num  Hall  (cbh),  Coxiella  burnetii  RSA  493  (cbu),  Franci- 
sella  tularensis  SCHU  S4  (ftu),  Rickettsia  prowazekii  Ma¬ 
drid  E  (rpr),  Salmonella  enterica  serovar  Typhi  CT18 
(sty),  Vibrio  cholerae  N16961  (vch),  and  Y.  pestis  C092 
(ype)].  For  benchmarking  purposes,  we  consider  the 
enzyme  annotations  in  the  KEGG  database  (http:// 
www.genome.jp/kegg/)  as  the  gold  standard,  since  these 
annotations  combine  the  results  of  multiple  resources 
and  are  partially  curated.  Figure  3  shows  the  CatFam 
results  along  with  the  predictions  obtained  with  PRIAM 
and  EFICAz  (http://cssb2.biology.gatech.edu/EFICAz/).  In 
this  test,  we  use  the  genome- oriented  release  of  PRIAM, 
which  is  slightly  different  from  the  gene -oriented  release 
used  in  the  tests  discussed  earlier.  EFICAz’s  predictions 
and  the  KEGG’s  annotations  are  directly  downloaded 
from  their  Web- sites  in  September  2007.  Here,  we  use 
the  CatFam  database  with  FPR  preset  to  10%  because  it 
provides  a  good  trade-off  between  precision  and  recall. 
Figure  3(a)  shows  the  fraction  of  catalytic  function  pre¬ 
dictions  that  agrees  with  KEGG’s  annotations,  that  is,  it 
provides  a  measure  of  precision,  whereas  Figure  3(b) 
shows  the  fraction  of  KEGG’s  catalytic  function  annota¬ 
tions  that  are  predicted  by  each  method,  that  is,  recall. 

Comparison  of  the  three  methods  indicates  that  the 
CatFam  predictions  yield  the  largest  precision  for  all  13 
genomes  [Figure  3(a)].  EFICAz’s  precision  is  almost  as 


good  as  CatFam’s,  both  of  which  are  substantially  better 
than  that  of  PRIAM.  All  of  CatFam’s  precisions  are  in 
the  70-80%  range  except  for  C.  botulinum.  This  genome 
was  recently  sequenced,  and  only  21  of  its  proteins  are 
recorded  in  the  most  recent  Swiss-Prot  database.  The 
lack  of  appropriate  training  proteins  may  explain  why 
both  CatFam  and  PRIAM  reach  their  lowest  precision 
values,  which  are  about  55  and  40%,  respectively,  for  this 
genome.  The  EFICAz  Web-site  does  not  provide  predic¬ 
tions  for  C.  botulinum.  Figure  3(b)  shows  that  CatFam 
yields  the  highest  recall  for  seven  genomes  and  that  in 
three  cases  its  recall  is  substantially  lower  than  that  of 
PRIAM.  This  is  consistent  with  the  fact  that  often 
PRIAM  predicts  many  more  enzymes  than  the  other  two 
methods,  increasing  recall  at  the  expense  of  deteriorating 
precision.  Compared  with  PRIAM,  both  CatFam  and 
EFICAz  are  more  conservative  tools,  optimized  for  accu¬ 
rate  enzyme  function  predictions. 

Despite  the  overall  comparable  performance  of  CatFam 
and  EFICAz,  we  observe  substantial  differences  when 
comparing  the  predictions  for  each  of  the  13  bacterial 
genomes.  Similar  differences  are  observed  when  compar¬ 
ing  with  PRIAM’s  predictions  as  well.  For  example,  the 
Venn  diagram  in  Figure  4  shows  the  overlap  of  the  three 
methods’  catalytic  function  predictions  for  Y.  pestis  C092 
(ype).  Although  the  majority  of  the  three  methods’ 
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Figure  3 

Comparison  of  catalytic  function  predictions  based  on  CatFam, 

EFICAz,  and  PRIAM  for  13  bacterial  genomes,  using  KEGG  as  the  gold 
standard,  (a)  shows  the  fraction  of  catalytic  function  predictions  that 
agrees  with  KEGG’s  annotations,  that  is,  precision,  (b)  Shows  the 
fraction  of  KEGG’s  catalytic  function  annotations  that  are  predicted  by 
each  method,  that  is,  recall.  The  rightmost  bars  in  each  of  the  two 
panels  indicate  the  average  values  over  the  13  genomes.  [Color  figure 
can  be  viewed  in  the  online  issue,  which  is  available  at 
www.interscience.wiley.com.  ] 
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Figure  4 

The  Venn  diagram  of  catalytic  function  predictions  based  on  CatFam, 
EFICAz,  and  PRIAM  for  Y.  pestis  C092  (ype).  The  number  of  common 
predictions  is  labeled  in  each  intersecting  area.  The  number  of 
predictions  solely  provided  by  each  method  is  labeled  in  each 
nonintersecting  area.  The  total  number  of  annotations  from  each 
method  is  labeled  outside  of  the  diagram. 

predictions  yield  the  same  functions  (631),  each  provides 
additional  catalytic  function  predictions  that  are  not 
inferred  by  the  others.  PRIAM  provides  the  largest  num¬ 
ber  of  unique  predictions  (331),  consisting  of  30.0%  of 
its  total  predictions.  Between  CatFam  and  EFICAz,  about 
19.0%  of  the  predictions  from  one  method  are  not  pro¬ 
vided  by  the  other.  We  observe  similar  results  for  all  bac¬ 
terial  genomes  except  for  C.  botulinum.  On  average, 
21.9%  of  CatFam’s  predictions  are  not  inferred  by  EFI¬ 
CAz  and  26.0%  of  EFICAz’s  predictions  are  not  inferred 
by  CatFam.  False  predictions  may  contribute  to  some  of 
these  unique  predictions.  However,  we  expect  the  major¬ 
ity  of  these  differences  to  be  attributed  to  slight  meth¬ 
odological  differences,  especially  for  the  differences 
between  CatFam  and  EFICAz,  which  are  designed  for 
making  highly  accurate  predictions. 


Automated  metabolic  pathway 
reconstruction 

Reconstruction  of  an  organism’s  metabolic  pathways  is 
a  key  element  for  understanding  the  meaning  of  protein 
functions  within  a  cellular  context.  Manual  curation  is 
the  best  way  to  obtain  high-quality  metabolic  pathways, 
but  it  is  labor  intensive  and  time  consuming.  Several 
tools  for  automated  metabolic  pathway  reconstructions 
have  been  developed  (http:/www.pathguide.org).  How¬ 
ever,  the  quality  of  the  reconstructed  pathways  is  highly 
dependent  on  the  precision  and  extent  to  which  the 
organism’s  enzymatic  functions  are  known  or  predicted. 
We  reconstruct  the  metabolic  pathway  of  two  organisms, 
Y.  pestis  C092  and  F.  tularensis  SCHU  S4,  employing  the 
Pathway  Tools  software. Initially,  GenBank  (http:// 
www.ncbi.nlm.nih.gov/Genbank/)  is  used  as  the  sole 
source  of  enzyme  function  annotation  and  then  it  is 
employed  in  combination  with  each  of  the  three  methods 
(CatFam,  PRIAM,  and  EFICAz). 

Table  V  compares  the  total  number  of  reactions  and 
pathways  predicted  by  the  PathoLogic-module  of  Path¬ 
way  Tools.  As  expected,  the  number  of  predicted  reac¬ 
tions  and  pathways  increases  as  automated  enzyme  anno¬ 
tations  are  added  to  GenBank.  The  number  of  predicted 
pathways  is  largest  when  the  PRIAM  predictions  are 
added  to  GenBank  because,  usually,  PRIAM  provides  the 
largest  number  of  enzyme  function  predictions  for  a 
given  genome.  However,  according  to  the  analysis  dis¬ 
cussed  earlier,  some  of  these  predictions  are  false  posi¬ 
tives.  Biasing  the  reconstruction  toward  false  positives 
may  be  desirable  to  provide  more  information  to  manual 
curators.  For  such  an  application,  PRIAM  is  a  valuable 
tool  and  is  already  in  use.  A  combination  of  multiple 
prediction  tools  with  complementary  enzyme  coverage  is 
also  valuable,  as  demonstrated  by  the  joint  predictions  of 
CatFam,  PRIAM,  and  EFICAz  with  GenBank.  Combined, 
when  compared  against  GenBank,  they  increase  the  num¬ 
ber  of  predicted  pathways  for  Y.  pestis  and  F.  tularensis 
by  22.6  and  19.8%,  respectively.  When  manual  curation 


Table  V 

Comparison  of  Predicted  Pathways  for  Y.  pestis  and  F.  tularensis 


Organism 

Annotation 

Number  of 

enzyme-catalyzed  reactions 

Number  of 
predicted  pathways 

Number  of 
pathways  with  holes 

Yersinia  pestis  CO 92 

GenBank 

1000 

239 

142 

GenBank  +  CatFam 

1060 

254 

133 

GenBank  +  PRIAM 

1186 

278 

145 

GenBank  +  EFICAz 

1088 

261 

141 

GenBank  +  CatFam  +  PRIAM  +  EFICAz 

1226 

284 

145 

Francisella 

GenBank 

717 

169 

106 

tularensis  SCHU  S4 

GenBank  +  CatFam 

745 

178 

107 

GenBank  +  PRIAM 

818 

188 

113 

GenBank  +  EFICAz 

754 

181 

106 

GenBank  +  CatFam  +  PRIAM  +  EFICAz 

859 

194 

114 

Various  enzyme  annotations  are  used  for  reconstruction:  GenBank,  GenBank  enhanced  by  CatFam  predictions,  GenBank  enhanced  by  PRIAM  predictions,  GenBank 
enhanced  by  EFICAz  predictions,  and  GenBank  enhanced  by  all  of  the  three  automated  prediction  methods. 
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is  not  available  or  not  feasible  because  of  the  large  num¬ 
ber  of  sequenced  genomes,  more  precise  prediction  tools, 
such  as  EFICAz  and  CatFam,  are  more  appropriate.  The 
number  of  predicted  pathways  for  these  tools  is  similar, 
with  approximately  95%  overlap. 

More  than  50%  of  the  pathways  contain  enzymes 
without  assigned  genes,  the  so-called  “pathway  holes.” 
Some  of  the  pathway  holes  are  filled  as  automated 
enzyme  annotations  are  provided  by  the  three  methods, 
but  many  remain  to  be  filled,  perhaps  by  using  the  path¬ 
way  hole  filler  module  of  PathoFogic.38>39  Fully  auto¬ 
mated  application  of  pathway  hole  filler  introduces  addi¬ 
tional  false  positives  and  is  not  used  for  pathway  recon¬ 
structions  in  this  analysis. 

DISCUSSION 

The  presented  results  indicate  that  the  CatFam  enzyme 
profiles  are  effective  in  discriminating  enzymes  from 
nonenzymes,  and  in  predicting  a  broad  range  of  protein 
catalytic  functions.  Although  the  issue  of  multi-domain, 
multi-functional  enzymes  is  not  especially  considered  in 
the  profile  generation  process,  different  domain  combina¬ 
tions  are  represented  by  distinct  profiles,  enabling  the 
prediction  of  catalytic  functions  for  multi-functional 
enzymes. 

We  observe  that  four-digit  EC  numbers  do  not  always 
classify  catalytic  functions  into  distinct  categories.  A  cata¬ 
lytic  function  classified  by  one  EC  number  may  be  sub¬ 
sumed  by  other  function  classified  by  a  different  EC 
number.  Therefore,  we  argue  that  eight  of  58  false  predic¬ 
tions  from  a  testing  set  of  7600  enzymes  are  in  fact  cor¬ 
rect.  These  include  two  predicted  EC  numbers  for  one 
protein  whose  annotation  were  revised  in  the  most  recent 
Swiss-Prot  database,  suggesting  that  CatFam  may  be  ro¬ 
bust  to  under- annotation  errors  in  Swiss-Prot.  Con¬ 
versely,  our  analysis  reveals  CatFam’s  limitations  in  dis¬ 
tinguishing  enzymes  with  very  similar  catalytic  functions. 
Enzymes  that  catalyze  the  same  type  of  reaction  but  act 
on  different,  yet  very  similar,  substrates  or  require  differ¬ 
ent  cofactors  are  difficult  to  distinguish  and  may  be 
missed  by  CatFam.  In  addition,  CatFam  occasionally  pre¬ 
dicts  EC  numbers  for  nonenzymes  that  are  homologous 
to  known  enzymes  but  do  not  possess  active  sites.  Fur¬ 
thermore,  precision  control  through  FPRs  may  also  give 
rise  to  a  relatively  large  number  of  false  predictions  for 
EC  numbers  that  are  overrepresented  in  the  training 
dataset. 

CONCLUSIONS 

We  present  a  new  method  termed  CatFam  that  gener¬ 
ates  enzyme  sequence  profiles  to  infer  protein  catalytic 
functions.  The  method  provides  a  procedure  for  specify¬ 
ing  the  nominal  FPR  of  each  profile,  thereby  controlling 


the  reliability  of  the  predicted  protein  functions.  This 
enables  the  generation  of  profile  databases  not  only  for 
highly  precise  function  annotation  but  also  for  moder¬ 
ately  precise  annotation  with  better  recall,  which  can  be 
useful  for  generating  hypothetical  protein  functions.  The 
use  of  profile-specific  thresholds  also  ensures  equal  preci¬ 
sion  for  each  profile  and  avoids  the  problem  of  having  a 
single  F-value  threshold  for  all  profiles,  which  yields 
good  overall  results  but  poor  performance  for  some 
profiles. 

Comparisons  with  well-established  resources  demon¬ 
strate  the  effectiveness  of  the  enzyme  profile  generation 
method  and  the  CatFam  databases.  They  not  only 
achieve  overall  excellent  precision  and  recall  but  also  per¬ 
form  well  for  enzymes  with  low  sequence  identity.  Com¬ 
parisons  based  on  three  testing  datasets  and  13  bacterial 
genomes  consistently  indicate  that  CatFam  outperforms 
PRIAM  in  precision  and,  most  of  the  time,  in  recall  as 
well.  In  addition  to  various  improvements  in  the  profile 
generation  procedure,  use  of  negative  samples  and  pro- 
file-specific  thresholds  may  be  the  major  contributors  for 
CatFam’s  superior  performance.  This  is  supported  by  the 
consistently  high  precision  of  CatFam  in  discriminating 
enzymes  from  nonenzymes,  whereas  PRIAM’s  perform¬ 
ance  deteriorates  in  such  applications. 

Overall,  comparisons  between  CatFam  and  EFICAz  on 
whole- genome  annotation  examples  indicate  very  similar 
performance.  This  could  be  attributed  to  the  similar  pro¬ 
cedure  used  to  generate  the  CatFam  databases  and  one  of 
EFICAz’s  databases.  However,  the  predictions  do  not 
completely  overlap.  On  average,  21.9%  of  the  catalytic 
function  predictions  inferred  by  CatFam  for  13  bacterial 
genomes  (excluding  C.  botulinum ,  which  EFICAz  does 
not  provide  predictions  for)  are  not  inferred  by  EFICAz, 
whereas  26.0%  of  EFICAz’s  predictions  are  not  inferred 
by  CatFam.  This  is  perhaps  due  to  methodological  or 
training  dataset  differences  in  the  profile  generation. 
Although  further  comparisons  may  reveal  when  each 
method  performs  best,  for  now,  it  seems  appropriate  to 
use  them  complementarily,  even  combined  with  PRIAM, 
for  more  comprehensive  enzyme  annotation.  We  observe 
a  roughly  20%  increase  in  coverage  in  the  reconstruction 
of  metabolic  pathways  when  we  combine  the  predictions 
of  all  three  methods. 

DISCLAIMER 

The  opinions  and  assertions  contained  herein  are  the 
private  views  of  the  authors  and  are  not  to  be  construed 
as  official  or  as  reflecting  the  views  of  the  US  Army  or  of 
the  US  Department  of  Defense.  This  article  has  been 
approved  for  public  release  with  unlimited  distribution. 
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